Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) #135

dkimds · 2025-07-01T06:54:42Z

Summary

Fixes #134 by adding debug mode support to benchmarks that were missing it.

Changes

✅ AIME24: Added debug slicing [:2]
✅ AIME25: Added debug slicing [:2]
✅ AIW: Added debug slicing [:2]
✅ AMC23: Added debug slicing [:2]
✅ HMMT: Added debug slicing [:2]
✅ MATH500: Added debug slicing [:2]

Testing

# Before: These would fail or ignore debug flag
python -m eval.eval --model hf --tasks AIME24 --debug --model_args "pretrained=microsoft/DialoGPT-medium"

# After: All work with 5 examples max
python -m eval.eval --model hf --tasks AIME24,AIME25,AIW,AMC23,HMMT,MATH500 --debug --model_args "pretrained=microsoft/DialoGPT-medium"

Implementation Pattern

Following the established pattern from MTBench and other working benchmarks:
python

if self.debug:
    examples = examples[:5]

Impact

✅ Consistent debug behavior across all benchmarks
✅ Faster development iteration (5 examples vs full dataset)
✅ Reduced compute costs during testing
✅ No breaking changes to existing functionality

AIME24, AIME25, AIW, AMC23, HMMT, MATH500

dkimds · 2025-07-04T08:29:28Z

Hi @neginraoof, I noticed you recently reviewed a merged PR. Would you be able to take a look at my PR as well when you have some time? I’d really appreciate your feedback. Thanks!

Add debug mode to 5 benchmarks

b319a8d

AIME24, AIME25, AIW, AMC23, HMMT, MATH500

dkimds changed the title ~~Add debug mode to 5 benchmarks~~ Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) Jul 8, 2025

richardzhuang0412 mentioned this pull request Jul 23, 2025

Request to add Debug mode #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) #135

Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) #135

Uh oh!

dkimds commented Jul 1, 2025 •

edited

Loading

Uh oh!

dkimds commented Jul 4, 2025

Uh oh!

Uh oh!

Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) #135

Are you sure you want to change the base?

Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) #135

Uh oh!

Conversation

dkimds commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Implementation Pattern

Impact

Uh oh!

dkimds commented Jul 4, 2025

Uh oh!

Uh oh!

dkimds commented Jul 1, 2025 •

edited

Loading